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A Comparison of Best Model Selection Criteria 

in 

Multiple Regression 

Multiple regression permits model testing wherein a set of 
independent variables are hypothesized to predict a dependent 
variable. Oftentimes when the set of variables selected do not 
significantly predict, the researcher searches for a "subset" of 
variables that provides the best prediction model. The various 
multiple regression stepwise methods have been extensively *used 
for this purpose. Prior research, however, has indicated that 
the all possible subset approach is preferred over the stepwise 
methods in determining the best model (Berk, 1978; Thayer, 1986; 
Davidson, 1988; Henderson & Denison, 1989; Welge, 1990; Thayer, 
1990). Thompson et al . (1991), in further crticizing stepwise 
methods, recommended that effect sizes be computed for each "all 
possible" subset equation and that the subset model which has the 
desired effect size be chosen. 

Zuccaro (1992) investigated the use of the C p criteria in 
contrast to the stepwise methods for determining the best set of 
predictors using a sample data set. The C p statistic measures 
the total squared error variance in each subset model containing 
p predictor variables [error variance plus the bias introduced by 
not including important variables] . Findings suggested that the 
selection of the best subset model with the lowest bias is 
indicated by the smallest Mallows' C p criteria (Mallows, 1966; 
1973), especially in the presence of multicollinearity ♦ Pohlmann 
(1983) noted that multicollinearity among predictor variables 
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didn't affect the Type I error rate, but did affect the Type II 
error rate and width of the confidence interval. Findings 
suggested that sample size and model validity could compensate 
for multicollinearity effects, especially when certain research 
questions required models with highly correlated predictors, e.g. 

Y = p 1 x l + (3 2 X 2 ! + e. 

The principal component regression (PCR) approach has also 
been proposed as a criteria for selecting the best predictor 
model. The method appears to be useful when predicting values in 
one sample based upon estimates from another sample and when 
multicollinearity exists among a set of variables (Morrison, 
1976) . The rationale for using a PCR approach is when the mean 
squared error of a biased estimate is smaller than the variance 
of an unbiased estimate. The PCR method, however, is not 
appropriate for multiple regression subset models containing 
interactions (Aiken & West, 1993) nor when models depict 
nonlinear correlated predicter sets. The PCR method creates a 
set of new variables called principal components, which are 
uncorrelated or orthogonal, and therefore preclude it from being 
used in these types of models. 

A review of the literature indicated that researchers misuse 
stepwise methods to determine the best predictor set or interpret 
the importance of predictor variables (Huberty, 1989; Snyder, 
1991; Thompson, 1989; Thompson et al . , 1991, Welge, 1990). 
Stepwise methods inflate Type I error rates by not using the 
correct degrees of freedom in calculating the change in R 2 . 
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Additionally, researchers incorrectly interpret the order of 
variable selection as defining the "best set" of variables in the 
predictor set. Also, the order of variable entry is 
misinterpreted as determining which variables are the important 
predictors . 

The all possible subset approach is recommended as an 
alternative over stepwise methods for selecting the best set of 
predictor variables. Several criteria, however, are available 
for selecting the best subset model: R 2 , Adj. R 2 , MSE, C p , or 
the principal component regression method. How do these criteria 
compare when selecting the best subset model? When might a 
researcher choose one criteria over another for selecting the 
best model? The principal component regression method, which 
determines the best model for prediction by redefining the 
theoretical model, appears, more useful when estimates from one 
sample are used to predict in another sample. The C p statistic 
is useful when the predictor set is correlated, whereas the 
principal component method creates a set of orthogonal 
predictors. A comparison of the selection criteria and the PCR 
approach will permit an investigation of their usefulness for 
subset model selection. An applied example will illustrate a 
comparison of the criteria and afford further discussion. The 
objective of this study, therefore, was to compare the various 
model subset selection criteria and provide guidelines for the 
selection of the best "subset" model. 
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METHODS AND PROCEDURES 

Subjects 

Subjects came from a cohort of students accepted into the 
Texas Academy of Mathematics and Science (TAMS) at the University 
of North Texas in Fall, 1993. TAMS is an early college entrance 
program in which students earn approximately 60 hours of college 
credit by taking University of North Texas courses. Students 
enter TAMS at the beginning of their 11th year in high school. 
They live on campus in a special residence hall and take regular 
university courses in mathematics, science and the humanities. 
After two years, participants receive a special high school 
diploma and have amassed at least 60 hours of college credit. 
Each year approximately 200 high school sophomores, who have met 
the selection criteria and completed the 10th grade, are accepted 
into the Texas Academy of Mathematics and Science. 

In the study year, TAMS accepted 204 students. Of these, 
156 students attended an August orientation, which occurred a 
week prior to their first semester of college coursework, and 
completed the LASSI. There were 80 females and 76 males who 
participated in the study. The students who took the LASSI were 
similar in demographic background and academic ability as 
previous classes because of the academy's consistent admission 
requirements and pool of applicants. The participants SAT-M and 
SAT-V means and standard deviations, respectively, were: M=651, 
SD=57; and M=530, SD=75. 
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Instrument 

The LASSI is an English language assessment tool designed to 
measure college students' use of learning and study strategies. 
It was designed to provide assessment and pre-post achievement 
measures for students participating in a learning strategies and 
study skills project. A high-school version is available, but it 
was not recommended for use with accelerated students in these 
programs (Eldredge, 1990) . The LASSI can be administered in a 
group setting in approximately 30 minutes. The carbonless test 
format allows participants to score their own assessment and take 
a copy of the results with them from the testing session. 

The LASSI 's ten subscales focus on thoughts and behaviors 
related to successful learning. The ten subscales are (1) 
attitude; (2) motivation; (3) time management; (4) anxiety; (5) 
concentration; (6) information processing; (7) selecting the main 
ideas; (8) study aids; (9) self -testing ; and (10) test strategies 
(for more details see, Weinstein, 1987). Reliability studies 
reported Cronbach alpha internal consistency values ranging from 
.70 to .86 and test-retest reliabilities from .70 to .85. 
Validity studies have also reported normative data for high 
school and college students with different instruments for each 
group (Weinstein, Palmer, & Schulte, 1987). Students respond to 
individual items on each subscale usint, a five-point scale: (5) 
very typical of me; (4) fairly typical of me; (3) somewhat 
typical of me; (2) not very typical of me; and (1) not at all 
typical of me. Some item values are reverse keyed before being 
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added to obtain a subscale score. The subscale scores are 
compared by graphing them onto a normal curve equivalent 
percentile chart. 

According to the LASSI user's manual (Weinstein, 1987), 
students scoring above the 75th percentile do not need to improve 
that specific skill or strategy. Students scoring between the 
75th percenti3e and the 50th percentile should consider 
improvement. Students scoring below the 50th percentile on a 
subscale need assistance to improve that skill or strategy. For 
example, students scoring below the 50th percentile on the 
anxiety subscale would be considered anxious about being in 
college. Likewise, students scoring below the 50th percentile on 
the motivation subscale lack appropriate motivation to do college 
level work effectively. 

Research Question 

The research question of interest was whether the ten LASSI 
subscales could predict a student's college grade point average 
after one semester of college coursework. A related question 
pertained to whether a "subset" of the ten LASSI subscales could 
better predict college grade point average for this sample of 
students. Students not maintaining at least a 2 . 50 grade point 
average after one semester of college coursework were dismissed 
from the Academy. Knowledge of which subscales are best 
predictors of college grade point average will aid staff in 
identifying potential at-risk students upon entering the Academy. 

I 
■ 
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Data Analysis 

The SAS statistical program is in the Appendix. The 
student's college grade point average was predicted by the ten 
LASSI sutascales with the SELECTION statement requesting the best 
subset model criteria. A monotonic relationship existed between 
R 2 and Adjusted R 2 , in fact, r = 1.00 across all subset sizes. A 
monotonic relationship also existed between C p and MSE , in fact, 
r = 1.00 across all subset sizes. In addition, both R 2 and 
Adjusted R 2 had a perfect inverse relationship to both C p and 
MSE, i.e. r = -1.00, across all subset sizes. 

The R 2 statistic is calculated as the SS regression /SS tota] and is 
the most commonly used and reported value in determining the 
proportion of variance in the dependent variable accounted for by 
the independent variables (Pedhazur, 1982). The adjusted R 2 
value, which corrects for the number of predictors in the model, 
is computed as: - [p (1-R 2 ) /N-p-1] (Norusis, 1979). In 

determining the number of variables to include in the regression 
model, the researcher will typically test for significant 
increments in R 2 values between models with differing numbers of 
predictors. The mean square error (MSE) value is an unbiased 
estimate of a 2 , the variance of e, which represents random error 
and accounts for variation due to other factors. The MSE 
statistic is computed as: SS error /df 2i 

The Mallows' C p statistic with the intercept term is 
calculated as: C p = (1-R 2 P ) (n-T) / ( 1-R 2 T ) - (n - 2p) , or alternatively 
without the intercept term as: C p = (SSE P /MSE) - (n - 2p) + 1 



8 

(Freund & Littell, 1991). Mallows' C p is useful in measuring the 
level of bias in the parameter estimates (pj . The C p criteria 
has also been recommended for determining the best set of 
predictors. The PROC PRINCOMP procedure was used to create ten 
orthogonal principal component variables. The principal 
component variable parameter estimates were computed using the 
PROC REG procedure. The number of significant principal component 
parameter estimates were then identified. These procedures are 
outlined in the SAS System for Regression manual (Freund & 
Littell, 1991) . 

RESULTS 

The optimum subset model should generally be one that 
produces the minimum error sum of squares (MSE) , or equivalently 
maximizes the R 2 value. The procedure for finding the optimum 
subset of all possible subset sizes requires computing 2 m 
equations. The ten subscale predictors in the model yielded 1024 
regression equations (2 10 ) with associated selection criteria 
statistics {Note: The determination of the number of subset 
equations generated for p predictor variables from an m variable 
full model is: m! / [p ! (m-p) ! ] . For example, the number of 2 
variable subset equations (p) generated from a 10 variable model 
(m) would be 45} . 

The correlation matrix, means and standard deviations of the 
ten LASSI subscales are in Table 1. The intercorrelations among 
the subscales indicated that Anxiety /Worry was not significantly 
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correlated with Time Management, Information Processing, Support 
Techniques/Materials, and Self Testing/Class Preparation. The 
lowest subscale mean was on Selecting Main Ideas. 



Insert Table 1 Here 



The best subset model for each subset size with the 
corresponding selection criteria are in Table 2. A combined 
criteria, minimum error sum of squares (MSE) with maximum R 2 , 
indicated a five variable subdet model. In contrast, the C p 
criteria indicated a four variable model . The four variable 
subset model for predicting college grade point average consisted 
of the four subscales: Motivation, Anxiety /Worry, Support 
Techniques/Materials, and Self Testing/Class Preparation. The 
fifth variable indicated in the combined criteria selection was 
Information Processing . 



Insert Table 2 Here 



The C p criteria also indicated the bias in having too many 
variables in the model. Large C p values indicated equations with 
larger mean square error. If C p > (p + 1), for any subset size 
p, then V>ias was present. If C p < (p + 1), for any subset size 
p, then the model contained too many variables. A plot of the C p 
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values against the number of predictors, compared to a plot of 
the (p + 1) values, has been recommended for determining the best 
subset model (Mallows, 1973). 

The present pattern of C p values for the various subsets of 
size p are typical when mult icollineari ty is present. The C P 
values initially become smaller, but then start to increase. The 
plot of C p values is similar to a "scree" plot in factor analysis 
and as such a multiple regression method might also be useful in 
determining the number of variables to retain (Zoski & Jurs, 
1993). The best subset model is indicated when the C p values 
begin to increase and cross the (p + 1; values (see Figure 1). 



Insert Figure 1 Here 



Principal components were obtained by computing eigenvalues 
from the correlation matrix. The correlation matrix was used so 
that variables were not affected by the scale of measurement as 
in the use of a variance-covariance matrix. Since eigenvalues 
are the variances of the principal component variables, the sum 
of the eigenvalues equal the number of variables in the full 
model, just as the sum of standardized variable variances would 
equal the number of variables. This sum is the measure of the 
total variation in the data set. 

A wide variation in the eigenvalues would suggei z the 
presence of multicollinearity among the variables. The nuinber of 

Irs 
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eigenvalues greater than unity, as in factor analysis, would 
indicate the number of variables from the full model that would 
explain most of the variance in the data set. The eigenvectors, 
in contrast, contain the coefficients for each principal 
component variable . These coefficients are used to create the 
observed values of the original variables. These observed values 
are then used in multiple regression as orthogonal, uncorrelated 
predictor values with no multicollinearity present. Table 3 
contains the eigenvalues for the ten principal component 
variables generated from the correlation matrix of the ten LAS SI 
subcales . 



Insert Table 3 Here 



Preliminary inspection of the eigenvalues indicates three 
principal component variables that account for 72.4 % of the 
variance in predicting college grade point average (7.24/10.00). 
The first pri. cipal component alone explained 46 %. The ten 
principal component variables when analyzed in multiple 
regression yielded an R* = .19, Adj. R 2 = .13, and a MSE = .32 
which is identical to the ten predictor model using the original 
variables obtained from the subset model approach (Table 4) . 



Insert Table 4 Here 



13 



12 

A summary of parameter estimates in Table 5 indicates that 
model components 1, 4, and 8 are significant relative to other 
principal components in the full model. An examination of the 
coefficients in the eigenvectors for these principal components 
reveals which subscales contribute the most to the prediction of 
college grade point average (see Table 6). 



Insert 


Table 5 


Here 


Insert 


Table 6 


Here 



The first principal component indicates that all subscales 
contributed to prediction. The fourth principal component 
indicates that Attention, Anxiety /Worry , and Information 
Processing are important. The eighth principal component 
comprised Attention, Motivation, Time Management, Concentration, 
Information Processing, and Support Techniques/Materials. The 
fourth and eighth principal components suggested "factors'" which 
are secondary related to the primary construct tapped by the 
LASSI subscales. Providing names for the principal components, 
as in factor analysis, is subjective and only meaningful within 
the context of interpreting scores. These principal component 
results clearly indicate that all ten subscales, when treated as 
uncorrelated or orthogonal predictors, contributed to the 
prediction of future college grade point average the same as the 
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set of ten correlated predictors in the ordinary least squares 
approach . 

SUMMARY 

The C p criteria was the lowest for a four variable predictor 
model. This four variable subset model was verified by examining 
where the plot of C P values against the (p + 1) values crossed. 
The MSE criteria was also the lowest for the five variable model. 
The R : and Adjusted R 2 criteria also indicated a five variable 
model. The full model with all ten subscales as predictors 
yielded the same result as the principal components method using 
the first principal component e>ftracted. The first principal 
component yields the most variance accounted for in the set of 
variables. The C„ criteria selected the smallest variable subset 
model in the presence multicollinearity . 

In using multiple regression it is important to have a 
theoretical basis for the regression model and to consider sample 
to sample fluctuations in R : . A common misconception in multiple 
regression is that the model with all the significant predictors 
included is the best model. This isn't always the case. The 
problem is that the b's and R 2 values are data dependent due to 
the least squares criterion being applied to a specific sample of 
data. A different sample will usually result in different 
parameter estimates and variance explained. Although the 
standard errors of the b's do provide the researcher with some 
indication of the amount of change expected from sample to 
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sample, the fact remains that the estimates obtained from one 
sample may predict poorly when applied to a new set of sample 
data. The primary method to assess the change in R* or b's is t 
replicate the regression model using other sample data. 
Bootstrapping, jacknif ing, and cross-validation methods have a Is 
become useful in indicating the variation in b's and R 2 values 
when estimates from one sample are applied to another sample. 

The rationale behind a regression model is to estimate o : 
(the true model's mean square error variance) . Since O 2 is not 
generally known, a researcher must estimate it from a knowledge 
of prior research (a 2 = <T y<x ) , obtain estimates from a model 
containing all theoretically relevant predictors, replicate the 
study, or use bootstrapping, jacknifing, and cross-validation 
methods. In this regard, effect size considerations, as 
recommended by Thompson (1991), become important to consider. 
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